In your final group assignment you have to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. You can download AirBnB data from insideairbnb.com; it was originally scraped from airbnb.com.
The following Google sheet shows which cities you can use; please choose one of them and add your group name next to it, e.g., A7, B13. No city can have more than 2 groups per stream working on it; if this happens, I will allocate study groups to cities with the help of R’s sampling.
All of the listings are a GZ file, namely they are archive files compressed by the standard GNU zip (gzip) compression algorithm. You can download, save and extract the file if you wanted, but vroom::vroom() or readr::read_csv() can immediately read and extract this kind of a file. You should prefer vroom() as it is faster, but if vroom() is limited by a firewall, please use read_csv() instead.
vroom will download the *.gz zipped file, unzip, and provide you with the dataframe.
Even though there are many variables in the data frame, here is a quick description of some of the variables collected, and you can find a data dictionary here
price = cost per night
property_type: type of accommodation (House, Apartment, etc.)
room_type:
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
neighbourhood*: three variables on a few major neighbourhoods in each city
In the R4DS Exploratory Data Analysis chapter, the authors state:
“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation… EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”
Conduct a thorough EDA. Recall that an EDA involves three things:
dplyr::glimpse()mosaic::favstats()skimr::skim()ggplot2::ggplot()
geom_histogram() or geom_density() for numeric continuous variablesgeom_bar() or geom_col() for categorical variablesGGally::ggpairs() for scatterplot/correlation matrix
aes call, for example: aes(colour = gender, alpha = 0.4)You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:
Using the glimpse() function with the dataset, there are 74 variables/columns and 17,703 rows/observations. The variables are split up in the following categories: character (24), date (5), logical (8), numeric (37)
The variables that are numbers can be found via the skim() function under “Variable type: numeric” and for instance are host_listings_count, latitude, accommodates, beds, minimum_nights and availability_60
The variables that are characters are those belonging to “Variable type: character” and are for instance name, listing_url, description and host_name. Those that are factor variables are labelled as “logical”, given they take a certain value based on a logical test. This is the case for bathrooms, host_is_superhost, instant_bookable. For a few of these variables there are numbers missing or NAs so we need to clean up data. To examine the distribution of categorical/factor variables, a bar chart looks like the best option
Below the analysis of raw values and summary statistics:
glimpse(listings) # We can see that a few variables have NA, N/A or values that depend on a logical testRows: 17,703
Columns: 74
$ id <dbl> 6400, 23986, 28300, 37256~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped <date> 2021-09-20, 2021-09-20, ~
$ name <chr> "The Studio Milan", "\" C~
$ description <chr> "Enjoy your stay at The S~
$ neighborhood_overview <chr> "The neighborhood is quie~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 13822, 95941, 121663, 119~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Francesca", "Jeremy", "M~
$ host_since <date> 2009-04-17, 2010-03-19, ~
$ host_location <chr> "Milan, Lombardia, Italy"~
$ host_about <chr> "I'm am Francesca Sottila~
$ host_response_time <chr> "N/A", "N/A", "N/A", "N/A~
$ host_response_rate <chr> "N/A", "N/A", "N/A", "N/A~
$ host_acceptance_rate <chr> "N/A", "N/A", "N/A", "N/A~
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, TRUE~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Zona 5", "Navigli", "Cen~
$ host_listings_count <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_total_listings_count <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood <chr> "Milan, Lombardy, Italy",~
$ neighbourhood_cleansed <chr> "TIBALDI", "NAVIGLI", "SA~
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude <dbl> 45.44195, 45.44991, 45.47~
$ longitude <dbl> 9.17797, 9.17597, 9.17359~
$ property_type <chr> "Private room in rental u~
$ room_type <chr> "Private room", "Entire h~
$ accommodates <dbl> 1, 4, 2, 1, 4, 4, 5, 3, 2~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "3.5 baths", "1 bath", "1~
$ bedrooms <dbl> 3, 1, 1, 1, 2, 2, 2, 2, 1~
$ beds <dbl> 1, 1, 2, 1, 4, 2, 3, 1, 1~
$ amenities <chr> "[\"Hangers\", \"Iron\", ~
$ price <chr> "$100.00", "$150.00", "$1~
$ minimum_nights <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights <dbl> 5, 730, 14, 730, 90, 30, ~
$ minimum_minimum_nights <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_minimum_nights <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ minimum_maximum_nights <dbl> 5, 730, 14, 1125, 90, 30,~
$ maximum_maximum_nights <dbl> 5, 730, 14, 1125, 90, 30,~
$ minimum_nights_avg_ntm <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights_avg_ntm <dbl> 5, 730, 14, 1125, 90, 30,~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 23, 28, 30, 0, 0, 23, 0, ~
$ availability_60 <dbl> 53, 58, 60, 0, 0, 53, 0, ~
$ availability_90 <dbl> 83, 88, 90, 0, 0, 83, 0, ~
$ availability_365 <dbl> 358, 363, 365, 0, 203, 35~
$ calendar_last_scraped <date> 2021-09-20, 2021-09-20, ~
$ number_of_reviews <dbl> 12, 15, 8, 34, 37, 14, 27~
$ number_of_reviews_ltm <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review <date> 2014-04-11, 2015-09-21, ~
$ last_review <date> 2010-04-19, 2020-09-07, ~
$ review_scores_rating <dbl> 4.89, 4.64, 4.71, 4.90, 4~
$ review_scores_accuracy <dbl> 5.00, 4.53, 4.71, 4.79, 4~
$ review_scores_cleanliness <dbl> 5.00, 4.40, 4.86, 4.90, 4~
$ review_scores_checkin <dbl> 5.00, 4.40, 4.86, 5.00, 5~
$ review_scores_communication <dbl> 5.00, 4.53, 4.86, 5.00, 4~
$ review_scores_location <dbl> 4.56, 4.53, 5.00, 5.00, 4~
$ review_scores_value <dbl> 4.67, 4.53, 5.00, 4.59, 4~
$ license <chr> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable <lgl> FALSE, FALSE, FALSE, TRUE~
$ calculated_host_listings_count <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_entire_homes <dbl> 0, 1, 0, 1, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_private_rooms <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 0.13, 0.21, 0.11, 0.47, 0~
glimpse(listings$price) # When we focus on prices, we can see that they're expressed is US dollars and are not categorized as numeric variable due to the presence of the "$" sign in front of each value chr [1:17703] "$100.00" "$150.00" "$180.00" "$55.00" "$75.00" "$180.00" ...
skim(listings) # There are several variables with complete_rate that is not close to 1, meaning that there are many NAs encountered throughout the 17,703 observations| Name | listings |
| Number of rows | 17703 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 24 |
| Date | 5 |
| logical | 8 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 17703 | 0 |
| name | 10 | 1.00 | 1 | 255 | 0 | 17460 | 0 |
| description | 415 | 0.98 | 1 | 1000 | 0 | 16920 | 0 |
| neighborhood_overview | 7400 | 0.58 | 1 | 1000 | 0 | 9456 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 17502 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 11915 | 0 |
| host_name | 1 | 1.00 | 1 | 35 | 0 | 2861 | 0 |
| host_location | 42 | 1.00 | 2 | 82 | 0 | 919 | 0 |
| host_about | 7753 | 0.56 | 1 | 5519 | 0 | 5665 | 16 |
| host_response_time | 1 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 1 | 1.00 | 2 | 4 | 0 | 65 | 0 |
| host_acceptance_rate | 1 | 1.00 | 2 | 4 | 0 | 91 | 0 |
| host_thumbnail_url | 1 | 1.00 | 55 | 106 | 0 | 11847 | 0 |
| host_picture_url | 1 | 1.00 | 57 | 109 | 0 | 11847 | 0 |
| host_neighbourhood | 5044 | 0.72 | 4 | 33 | 0 | 84 | 0 |
| host_verifications | 0 | 1.00 | 2 | 158 | 0 | 291 | 0 |
| neighbourhood | 7400 | 0.58 | 12 | 83 | 0 | 44 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 28 | 0 | 86 | 0 |
| property_type | 0 | 1.00 | 4 | 35 | 0 | 52 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 17 | 1.00 | 6 | 16 | 0 | 25 | 0 |
| amenities | 0 | 1.00 | 2 | 1596 | 0 | 16087 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 513 | 0 |
| license | 16382 | 0.07 | 1 | 110 | 0 | 1122 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-19 | 2021-09-20 | 2021-09-20 | 2 |
| host_since | 1 | 1.00 | 2009-02-19 | 2021-09-14 | 2015-04-29 | 3289 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-19 | 2021-09-20 | 2021-09-20 | 2 |
| first_review | 4532 | 0.74 | 2011-04-18 | 2021-09-20 | 2018-07-27 | 2593 |
| last_review | 4532 | 0.74 | 2010-04-19 | 2021-09-20 | 2019-10-20 | 2042 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 1 | 1 | 0.17 | FAL: 14638, TRU: 3064 |
| host_has_profile_pic | 1 | 1 | 1.00 | TRU: 17627, FAL: 75 |
| host_identity_verified | 1 | 1 | 0.79 | TRU: 13901, FAL: 3801 |
| neighbourhood_group_cleansed | 17703 | 0 | NaN | : |
| bathrooms | 17703 | 0 | NaN | : |
| calendar_updated | 17703 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.99 | TRU: 17503, FAL: 200 |
| instant_bookable | 0 | 1 | 0.40 | FAL: 10657, TRU: 7046 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.670292e+07 | 15763634.89 | 6.400000e+03 | 1.232248e+07 | 2.712945e+07 | 4.037565e+07 | 5.234077e+07 | ▇▆▆▇▇ |
| scrape_id | 0 | 1.00 | 2.021092e+13 | 0.00 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 8.730714e+07 | 106936802.81 | 8.168000e+03 | 1.312530e+07 | 3.202901e+07 | 1.351926e+08 | 4.229571e+08 | ▇▂▁▁▁ |
| host_listings_count | 1 | 1.00 | 1.674000e+01 | 68.04 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 1.507000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 1 | 1.00 | 1.674000e+01 | 68.04 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 1.507000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 4.547000e+01 | 0.02 | 4.539000e+01 | 4.545000e+01 | 4.547000e+01 | 4.549000e+01 | 4.553000e+01 | ▁▂▇▇▁ |
| longitude | 0 | 1.00 | 9.190000e+00 | 0.03 | 9.070000e+00 | 9.170000e+00 | 9.190000e+00 | 9.210000e+00 | 9.280000e+00 | ▁▂▇▆▁ |
| accommodates | 0 | 1.00 | 3.000000e+00 | 1.49 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| bedrooms | 1394 | 0.92 | 1.250000e+00 | 0.57 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| beds | 172 | 0.99 | 1.760000e+00 | 1.15 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 5.510000e+00 | 27.14 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 3.823783e+04 | 5010544.58 | 1.000000e+00 | 2.900000e+01 | 3.650000e+02 | 1.125000e+03 | 6.666667e+08 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 5.390000e+00 | 26.93 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 6.000000e+00 | 27.20 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 3.837286e+04 | 5010543.57 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 6.666667e+08 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 3.839390e+04 | 5010543.41 | 1.000000e+00 | 3.100000e+01 | 1.125000e+03 | 1.125000e+03 | 6.666667e+08 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 5.730000e+00 | 27.06 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 3.838416e+04 | 5010543.48 | 1.000000e+00 | 3.100000e+01 | 1.125000e+03 | 1.125000e+03 | 6.666667e+08 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 8.510000e+00 | 11.02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.900000e+01 | 3.000000e+01 | ▇▁▁▁▂ |
| availability_60 | 0 | 1.00 | 2.099000e+01 | 23.57 | 0.000000e+00 | 0.000000e+00 | 7.000000e+00 | 4.600000e+01 | 6.000000e+01 | ▇▁▁▂▃ |
| availability_90 | 0 | 1.00 | 3.542000e+01 | 36.58 | 0.000000e+00 | 0.000000e+00 | 1.900000e+01 | 7.600000e+01 | 9.000000e+01 | ▇▁▂▂▅ |
| availability_365 | 0 | 1.00 | 1.408300e+02 | 135.32 | 0.000000e+00 | 0.000000e+00 | 9.000000e+01 | 2.680000e+02 | 3.650000e+02 | ▇▃▂▂▅ |
| number_of_reviews | 0 | 1.00 | 2.493000e+01 | 59.78 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | 2.000000e+01 | 9.140000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 2.620000e+00 | 7.56 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.230000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 4.900000e-01 | 1.37 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.900000e+01 | ▇▁▁▁▁ |
| review_scores_rating | 4532 | 0.74 | 4.580000e+00 | 0.83 | 0.000000e+00 | 4.540000e+00 | 4.810000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 4829 | 0.73 | 4.760000e+00 | 0.43 | 0.000000e+00 | 4.700000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 4828 | 0.73 | 4.700000e+00 | 0.49 | 0.000000e+00 | 4.600000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 4830 | 0.73 | 4.820000e+00 | 0.40 | 0.000000e+00 | 4.800000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 4828 | 0.73 | 4.830000e+00 | 0.40 | 0.000000e+00 | 4.820000e+00 | 4.970000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 4830 | 0.73 | 4.710000e+00 | 0.41 | 0.000000e+00 | 4.600000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 4830 | 0.73 | 4.610000e+00 | 0.49 | 0.000000e+00 | 4.500000e+00 | 4.720000e+00 | 4.900000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 1.305000e+01 | 44.33 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 3.130000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.232000e+01 | 44.33 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.130000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 6.500000e-01 | 2.22 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 2.800000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 5.000000e-02 | 0.45 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 8.000000e+00 | ▇▁▁▁▁ |
| reviews_per_month | 4532 | 0.74 | 9.200000e-01 | 1.38 | 1.000000e-02 | 1.000000e-01 | 3.600000e-01 | 1.070000e+00 | 1.500000e+01 | ▇▁▁▁▁ |
skim(listings$price) # There are 513 different unique price fares for the whole Milan neighbourhood| Name | listings$price |
| Number of rows | 17703 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 5 | 10 | 0 | 513 | 0 |
favstats(listings$beds) # AirBnBs in Milan have on average of 1.75 beds with more than 50% of accommodations having 1 bed. The observations with bed number equal to zero might be simple studios for daily rent or other types of accommodations not including a bed| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2 | 16 | 1.76 | 1.15 | 17531 | 172 |
ggplot(data = listings) +
geom_histogram(mapping = aes(x = beds), binwidth = 1) +
labs(title = "Distribution number of beds") +
ylab("Count") +
xlab("Beds") +
scale_y_continuous(breaks = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000)) # The distribution is evidently right-skewedBelow we proceed with data visualizations:
#Example on visualizing distribution and number of occurrences for categorical variables (character, factor). Open graph in new window in order to visualize the right proportions
ggplot(data = listings) +
geom_bar(mapping = aes(y = neighbourhood_cleansed, fill = neighbourhood_cleansed)) +
labs(title = "Number of listings per neighbourhood") +
xlab("Count") +
ylab("Neighbourhoods") +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1500), breaks = c(250, 500, 750, 1000, 1250)) +
theme(legend.position = "none",
axis.text.x = element_text(size = 3),
axis.text.y = element_text(size = 3))listings %>%
count(neighbourhood_cleansed)| neighbourhood_cleansed | n |
|---|---|
| ADRIANO | 44 |
| AFFORI | 130 |
| BAGGIO | 46 |
| BANDE NERE | 309 |
| BARONA | 41 |
| BICOCCA | 48 |
| BOVISA | 188 |
| BOVISASCA | 9 |
| BRERA | 693 |
| BRUZZANO | 41 |
| BUENOS AIRES - VENEZIA | 1341 |
| CANTALUPA | 5 |
| CENTRALE | 622 |
| CHIARAVALLE | 4 |
| CITTA' STUDI | 434 |
| COMASINA | 25 |
| CORSICA | 202 |
| DE ANGELI - MONTE ROSA | 295 |
| DERGANO | 236 |
| DUOMO | 992 |
| EX OM - MORIVIONE | 95 |
| FARINI | 103 |
| FIGINO | 6 |
| FORZE ARMATE | 57 |
| GALLARATESE | 122 |
| GARIBALDI REPUBBLICA | 227 |
| GHISOLFA | 225 |
| GIAMBELLINO | 302 |
| GIARDINI PORTA VENEZIA | 19 |
| GRATOSOGLIO - TICINELLO | 31 |
| GRECO | 126 |
| GUASTALLA | 377 |
| ISOLA | 621 |
| LAMBRATE | 106 |
| LODI - CORVETTO | 304 |
| LORENTEGGIO | 60 |
| LORETO | 733 |
| MACIACHINI - MAGGIOLINA | 254 |
| MAGENTA - S. VITTORE | 432 |
| MAGGIORE - MUSOCCO | 62 |
| MECENATE | 76 |
| MUGGIANO | 3 |
| NAVIGLI | 836 |
| NIGUARDA - CA' GRANDA | 120 |
| ORTOMERCATO | 30 |
| PADOVA | 309 |
| PAGANO | 186 |
| PARCO AGRICOLO SUD | 4 |
| PARCO BOSCO IN CITT | 5 |
| PARCO DEI NAVIGLI | 8 |
| PARCO DELLE ABBAZIE | 9 |
| PARCO FORLANINI - ORTICA | 46 |
| PARCO LAMBRO - CIMIANO | 145 |
| PARCO MONLUE' - PONTE LAMBRO | 21 |
| PARCO NORD | 5 |
| PARCO SEMPIONE | 24 |
| PORTA ROMANA | 366 |
| PORTELLO | 89 |
| QT 8 | 41 |
| QUARTO CAGNINO | 27 |
| QUARTO OGGIARO | 58 |
| QUINTO ROMANO | 13 |
| QUINTOSOLE | 2 |
| RIPAMONTI | 90 |
| ROGOREDO | 35 |
| RONCHETTO DELLE RANE | 1 |
| RONCHETTO SUL NAVIGLIO | 100 |
| S. CRISTOFORO | 253 |
| S. SIRO | 101 |
| SACCO | 7 |
| SARPI | 795 |
| SCALO ROMANA | 136 |
| SELINUNTE | 145 |
| STADERA | 245 |
| TIBALDI | 266 |
| TICINESE | 732 |
| TORTONA | 415 |
| TRE TORRI | 25 |
| TRENNO | 12 |
| TRIULZO SUPERIORE | 14 |
| UMBRIA - MOLISE | 231 |
| VIALE MONZA | 216 |
| VIGENTINA | 256 |
| VILLAPIZZONE | 330 |
| WASHINGTON | 373 |
| XXII MARZO | 535 |
#Example on visualizing distribution and number of occurrences for continuous variables (numeric, dates)
ggplot(data = listings) +
geom_histogram(mapping = aes(x = host_since), binwidth = 200) +
labs(title = "Distribution of host registration year on AirBnB") +
xlab("Year of registration") +
ylab("Count")The distribution of host registration is close to normal, with a fat right tail. Most hosts in the city of Milan registered on AirBnB in 2015, with hosts starting to regiter on the platform from 2010 and diminishing around 2020 due to COVID-19.
Values around 2010 are rare because AirBnB was not that popular at the time. In fact, AirBnB was founded in 2008. Thus, it matches our expectations, since the company became more popular starting from 2013.
Not spotting any unusual pattern in addition to what has already been stated. No particular clusters to spot
The matrix drawn below gives us an idea of the correlation among variables, filtered by type of room as well. The correlations between variables are generally very different
For instance, in terms of those variables that are positively correlated, people and prices are positively correlated (coefficient of 0.311). The correlation for hotel rooms is higher (0.311), while that for shared rooms is lowest (0.046)
Host year and minimum nights are negatively correlated (coefficient of -0.084), that is, the minimum nights required for a booking by hosts over time have gradually diminished. This statements makes perfect sense in a competitive environement such as AirBnB. There is an exception within this analysis: hotel room types have an opposite trend, that is, minimum nights required have gradually increased with the passing of years
# Parsing price variable and filtering based on rational numbers for AirBnB prices
listings_1 <- listings %>%
mutate(price = parse_number(price))
listings_filtered <- listings_1 %>%
filter(beds < 10 & minimum_nights <= 10 & price <= 250)
# Choosing variables of our interest in order to draw conclusions on relationships between them according to the previously stated filters and
ggpairs(listings_filtered,
mapping = aes(col = room_type),
columns = c("host_since", "price", "minimum_nights", "longitude", "accommodates", "beds", "bedrooms"),
columnLabels = c("Host year", "Price", "Minimum nights","Longitude", "People", "Beds", "Bedrooms")) +
theme_bw() At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!
In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.
Once you load the data, it’s always a good idea to use glimpse to see what kind of variables you have and what data type (chr, num, logical, date, etc) they are.
Notice that some of the price data (price) is given as a character string, e.g., “$176.00”
Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number.
Use typeof(listing$price) to confirm that price is now stored as a number.
# Investigating the price variable has been already done above: cleaning currency in front of each value -> remember not to run parsing more than once otherwise you should start from scratch.
#Checking that price observations are now stored as numeric, i.e. "double"
typeof(listings_1$price)[1] "double"
# Checking on the distribution of prices: right-skewed, so we introduce a new variable to better capture price range in Milan and thus check its behavior
ggplot(data = listings_1) +
geom_histogram(mapping = aes(x = price), binwidth = 20) +
labs(title = "Distribution of prices") +
xlab("Price") +
ylab("Count")listings_2 <- listings_1 %>%
mutate(logprice = 4*log(listings_1$price))
ggplot(data = listings_2) +
geom_histogram(mapping = aes(x = logprice), binwidth = 1) +
labs(title = "Distribution of prices") +
xlab("Price") +
ylab("Count")# Detecting outliers and getting rid of them (negative prices, zero prices and extreme prices). A good range of data based on the below box plot seems between 10 and 25. Let's dig better into specific numbers with using the IQR rule and clean up outliers.
ggplot(data = listings_2) +
geom_boxplot(mapping = aes(x = logprice)) +
labs(title = "Detecting outliers")favstats(listings_2$logprice)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 8.79 | 16 | 17.5 | 19.5 | 37.6 | 17.9 | 2.83 | 17703 | 0 |
Q1 <- quantile(listings_2$logprice, .25)
Q3 <- quantile(listings_2$logprice, .75)
IQR <- IQR(listings_2$logprice)
no_outliers <- subset(listings_2, listings_2$logprice > (Q1 - 1.5*IQR) & listings_2$logprice < (Q3 + 1.5*IQR))
ggplot(data = no_outliers) +
geom_boxplot(mapping = aes(x = logprice)) +
labs(title = "Cleaned outliers")Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?
skim(listings$property_type) # There are 52 unique property types among the observations| Name | listings$property_type |
| Number of rows | 17703 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 4 | 35 | 0 | 52 | 0 |
# Properties ordered by frequency: the top 4 most common property types are "Entire rental unit", "Private room in rental unit", "Entire condominium (condo)", "Entire loft". These top 4 properties make up more than 85% of total properties types.
property_count <- listings %>%
count(property_type) %>%
arrange(desc(n))
property_count %>%
mutate(property_sum = sum(property_count$n),
property_perc = n / property_sum)| property_type | n | property_sum | property_perc |
|---|---|---|---|
| Entire rental unit | 10178 | 17703 | 0.575 |
| Private room in rental unit | 2661 | 17703 | 0.15 |
| Entire condominium (condo) | 1830 | 17703 | 0.103 |
| Entire loft | 833 | 17703 | 0.0471 |
| Private room in condominium (condo) | 631 | 17703 | 0.0356 |
| Entire residential home | 274 | 17703 | 0.0155 |
| Entire serviced apartment | 189 | 17703 | 0.0107 |
| Shared room in rental unit | 182 | 17703 | 0.0103 |
| Private room in residential home | 168 | 17703 | 0.00949 |
| Private room in bed and breakfast | 137 | 17703 | 0.00774 |
| Private room in loft | 91 | 17703 | 0.00514 |
| Room in boutique hotel | 56 | 17703 | 0.00316 |
| Room in hotel | 56 | 17703 | 0.00316 |
| Private room in serviced apartment | 36 | 17703 | 0.00203 |
| Shared room in condominium (condo) | 33 | 17703 | 0.00186 |
| Private room in villa | 30 | 17703 | 0.00169 |
| Tiny house | 26 | 17703 | 0.00147 |
| Room in aparthotel | 24 | 17703 | 0.00136 |
| Entire guest suite | 22 | 17703 | 0.00124 |
| Private room in guest suite | 21 | 17703 | 0.00119 |
| Private room in townhouse | 20 | 17703 | 0.00113 |
| Entire villa | 19 | 17703 | 0.00107 |
| Room in bed and breakfast | 19 | 17703 | 0.00107 |
| Entire townhouse | 18 | 17703 | 0.00102 |
| Private room in hostel | 17 | 17703 | 0.00096 |
| Room in serviced apartment | 17 | 17703 | 0.00096 |
| Shared room in hostel | 17 | 17703 | 0.00096 |
| Room in hostel | 12 | 17703 | 0.000678 |
| Private room | 10 | 17703 | 0.000565 |
| Entire place | 9 | 17703 | 0.000508 |
| Shared room in loft | 9 | 17703 | 0.000508 |
| Private room in tiny house | 8 | 17703 | 0.000452 |
| Shared room in residential home | 8 | 17703 | 0.000452 |
| Private room in casa particular | 5 | 17703 | 0.000282 |
| Shared room in bed and breakfast | 5 | 17703 | 0.000282 |
| Private room in guesthouse | 4 | 17703 | 0.000226 |
| Casa particular | 3 | 17703 | 0.000169 |
| Entire bed and breakfast | 3 | 17703 | 0.000169 |
| Entire guesthouse | 3 | 17703 | 0.000169 |
| Entire home/apt | 3 | 17703 | 0.000169 |
| Shared room in tiny house | 3 | 17703 | 0.000169 |
| Camper/RV | 2 | 17703 | 0.000113 |
| Dome house | 2 | 17703 | 0.000113 |
| Boat | 1 | 17703 | 5.65e-05 |
| Cave | 1 | 17703 | 5.65e-05 |
| Earth house | 1 | 17703 | 5.65e-05 |
| Island | 1 | 17703 | 5.65e-05 |
| Private room in camper/rv | 1 | 17703 | 5.65e-05 |
| Private room in cave | 1 | 17703 | 5.65e-05 |
| Private room in earth house | 1 | 17703 | 5.65e-05 |
| Private room in farm stay | 1 | 17703 | 5.65e-05 |
| Tipi | 1 | 17703 | 5.65e-05 |
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.
listings_3 <- listings_2 %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit", "Entire condominium (condo)","Entire loft") ~ property_type,
TRUE ~ "Other"
))Use the code below to check that prop_type_simplified was correctly made.
listings_3 %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 10178 |
| Private room in rental unit | Private room in rental unit | 2661 |
| Entire condominium (condo) | Entire condominium (condo) | 1830 |
| Entire loft | Entire loft | 833 |
| Private room in condominium (condo) | Other | 631 |
| Entire residential home | Other | 274 |
| Entire serviced apartment | Other | 189 |
| Shared room in rental unit | Other | 182 |
| Private room in residential home | Other | 168 |
| Private room in bed and breakfast | Other | 137 |
| Private room in loft | Other | 91 |
| Room in boutique hotel | Other | 56 |
| Room in hotel | Other | 56 |
| Private room in serviced apartment | Other | 36 |
| Shared room in condominium (condo) | Other | 33 |
| Private room in villa | Other | 30 |
| Tiny house | Other | 26 |
| Room in aparthotel | Other | 24 |
| Entire guest suite | Other | 22 |
| Private room in guest suite | Other | 21 |
| Private room in townhouse | Other | 20 |
| Entire villa | Other | 19 |
| Room in bed and breakfast | Other | 19 |
| Entire townhouse | Other | 18 |
| Private room in hostel | Other | 17 |
| Room in serviced apartment | Other | 17 |
| Shared room in hostel | Other | 17 |
| Room in hostel | Other | 12 |
| Private room | Other | 10 |
| Entire place | Other | 9 |
| Shared room in loft | Other | 9 |
| Private room in tiny house | Other | 8 |
| Shared room in residential home | Other | 8 |
| Private room in casa particular | Other | 5 |
| Shared room in bed and breakfast | Other | 5 |
| Private room in guesthouse | Other | 4 |
| Casa particular | Other | 3 |
| Entire bed and breakfast | Other | 3 |
| Entire guesthouse | Other | 3 |
| Entire home/apt | Other | 3 |
| Shared room in tiny house | Other | 3 |
| Camper/RV | Other | 2 |
| Dome house | Other | 2 |
| Boat | Other | 1 |
| Cave | Other | 1 |
| Earth house | Other | 1 |
| Island | Other | 1 |
| Private room in camper/rv | Other | 1 |
| Private room in cave | Other | 1 |
| Private room in earth house | Other | 1 |
| Private room in farm stay | Other | 1 |
| Tipi | Other | 1 |
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:
minimum_nights?The most common values for the variable are 1, 2 and 3 nights.
There are observations which should obviously be deleted from the dataset, such as nights > 365 (that is, greater than a year). It may still be possible that some people booked an AirBnB for 1 year so we would rather keep that information
minimum_nights?The likely intended purpose for those AirBnBs with long bookings is long-term rental.
Filter the airbnb data so that it only includes observations with minimum_nights <= 4
min_nights_count <- listings_3 %>%
count(minimum_nights) %>%
arrange(minimum_nights)
ggplot(data = min_nights_count) +
geom_histogram(mapping = aes(x = minimum_nights))# Filtering bookings for nights below 4
min_nights_count_1 <- listings_3 %>%
count(minimum_nights) %>%
filter(minimum_nights <= 4)Visualizations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualizations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.
The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.
Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
listings_regr <- listings_3 %>%
filter(minimum_nights == 4 & accommodates == 2) %>%
mutate(price_4_nights = 4 * (price * accommodates),
log_price_4_nights = log(price_4_nights)) # Variable that we want to explainUse histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?
We should use log_price_4_nights given the more even distribution of prices, which in the case of price_4_nights is right-skewed and does not allow us to interpret data well.
skim(listings_regr$price_4_nights)| Name | listings_regr$price_4_nig… |
| Number of rows | 274 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 906.34 | 805.04 | 160 | 410 | 640 | 1200 | 8000 | ▇▁▁▁▁ |
# First histogram for price_4_nights: distribution difficult to interpret, hence we draw a log scale for the variable
ggplot(data = listings_regr) +
geom_histogram(mapping = aes(x = price_4_nights))# Second histogram for log(price_4_nights)
ggplot(data = listings_regr) +
geom_histogram(mapping = aes(x = log_price_4_nights)) +
labs(title = "Distribution of prices per four nights")Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
# Fitting regression model 1
model1 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = listings_regr)
summary(model1)
Call:
lm(formula = log_price_4_nights ~ prop_type_simplified + number_of_reviews +
review_scores_rating, data = listings_regr)
Residuals:
Min 1Q Median 3Q Max
-1.1349 -0.4711 -0.1103 0.4382 1.7703
Coefficients:
Estimate Std. Error t value
(Intercept) 6.456708 0.244629 26.394
prop_type_simplifiedEntire loft -0.052152 0.230220 -0.227
prop_type_simplifiedEntire rental unit -0.015456 0.143840 -0.107
prop_type_simplifiedOther -0.074000 0.202877 -0.365
prop_type_simplifiedPrivate room in rental unit -0.323364 0.167271 -1.933
number_of_reviews -0.003026 0.001068 -2.834
review_scores_rating 0.027319 0.043202 0.632
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire loft 0.82104
prop_type_simplifiedEntire rental unit 0.91455
prop_type_simplifiedOther 0.71572
prop_type_simplifiedPrivate room in rental unit 0.05476 .
number_of_reviews 0.00511 **
review_scores_rating 0.52794
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5989 on 183 degrees of freedom
(因为不存在,84个观察量被删除了)
Multiple R-squared: 0.07208, Adjusted R-squared: 0.04166
F-statistic: 2.369 on 6 and 183 DF, p-value: 0.03149
autoplot(model1)vif(model1) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.036657 4 1.004510
number_of_reviews 1.025684 1 1.012760
review_scores_rating 1.028962 1 1.014378
Theory section: The higher the t-value, the greater the confidence we have in the coefficient as a predictor. Low t-values are indications of low reliability of the predictive power of that coefficient.
review_scores_rating in terms of price_4_nights.The coefficient “review_scores_rating” has a t-statistic of 0.632, therefore it is not significant in explaining the variability of log_price_4_nights. Hence, we can get rid of it.
prop_type_simplified in terms of price_4_nights.The type of property is not a good predictor of the variability of prices given the very low t-values, therefore we cannot reject the null hypothesis. This means property type does not have significant explanatory power and we can disregard it as well.
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.
# Fitting regression model 2
model2 <- lm(log_price_4_nights ~ number_of_reviews + room_type, data = listings_regr) # Removed variables that were not statistically significant in model1 and added "room_type"
summary(model2)
Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type,
data = listings_regr)
Residuals:
Min 1Q Median 3Q Max
-1.62135 -0.54055 -0.05297 0.50385 2.29068
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.696519 0.047197 141.884 < 2e-16 ***
number_of_reviews -0.004093 0.001090 -3.754 0.000213 ***
room_typePrivate room -0.341134 0.092837 -3.675 0.000287 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6421 on 271 degrees of freedom
Multiple R-squared: 0.08816, Adjusted R-squared: 0.08143
F-statistic: 13.1 on 2 and 271 DF, p-value: 3.707e-06
autoplot(model2)vif(model2)number_of_reviews room_type
1.002828 1.002828
Model2 suggests that the type of rooms is a significant predictor of the variability of prices (t-value -3.675)
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
All the above variables were plugged in model3 in addition to those already existing. There doesn’t seem to be enough explanatory power in the newly added variables, so they can be dropped
# Sourcing the number of bathrooms per property from "bathrooms_text" and analyzing regression model
listings_regr_1 <- listings_regr %>%
mutate(bathrooms_clean = parse_number(bathrooms_text))
model3 <- lm(log_price_4_nights ~ number_of_reviews + room_type + bathrooms_clean + bedrooms + beds + host_is_superhost, data = listings_regr_1)
summary(model3)
Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type +
bathrooms_clean + bedrooms + beds + host_is_superhost, data = listings_regr_1)
Residuals:
Min 1Q Median 3Q Max
-1.65989 -0.51046 -0.03709 0.49041 2.25213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.608930 0.279486 23.647 < 2e-16 ***
number_of_reviews -0.003904 0.001384 -2.822 0.00519 **
room_typePrivate room -0.413553 0.099634 -4.151 4.66e-05 ***
bathrooms_clean 0.049490 0.179039 0.276 0.78247
bedrooms 0.013302 0.204195 0.065 0.94812
beds 0.063346 0.091425 0.693 0.48908
host_is_superhostTRUE -0.077034 0.148578 -0.518 0.60462
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6467 on 231 degrees of freedom
(因为不存在,36个观察量被删除了)
Multiple R-squared: 0.107, Adjusted R-squared: 0.08385
F-statistic: 4.615 on 6 and 231 DF, p-value: 0.0001894
autoplot(model3)vif(model3)number_of_reviews room_type bathrooms_clean bedrooms
1.198443 1.065264 1.042983 1.062875
beds host_is_superhost
1.102542 1.222614
instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?The option to book immediately an AirBnB listing does not have significant predictive power on price
model4 <- lm(log_price_4_nights ~ number_of_reviews + room_type + instant_bookable, data = listings_regr_1)
summary(model4)
Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type +
instant_bookable, data = listings_regr_1)
Residuals:
Min 1Q Median 3Q Max
-1.58490 -0.51537 -0.05437 0.48198 2.15328
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.660069 0.050740 131.258 < 2e-16 ***
number_of_reviews -0.004069 0.001085 -3.749 0.000217 ***
room_typePrivate room -0.363828 0.093163 -3.905 0.000119 ***
instant_bookableTRUE 0.173846 0.091524 1.899 0.058570 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.639 on 270 degrees of freedom
Multiple R-squared: 0.1002, Adjusted R-squared: 0.09018
F-statistic: 10.02 on 3 and 270 DF, p-value: 2.778e-06
autoplot(model4)vif(model4)number_of_reviews room_type instant_bookable
1.002963 1.019597 1.017063
neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variable neighbourhood_simplified and determine whether location is a predictor of price_4_nightsLocation is a reliable predictor of price_4_nights as t-values are big enough for the east, north and south. The estimates of these three variables are approximately -0.5, indicating that the property price in these there areas are significantly cheaper than the baseline (central area). Although the west variable is not significant, it just indicates that the price of the west area is basically the same as that of the baseline (central area). Therefore, we still need to keep the neibourhood variable to help predict the total cost.
listings_regr_2 <- listings_regr_1
listings_regr_2$neighbourhood_simplified <- ifelse(
listings_regr_2$neighbourhood_cleansed == "ADRIANO" |
listings_regr_2$neighbourhood_cleansed == "AFFORI" |
listings_regr_2$neighbourhood_cleansed == "BICOCCA" |
listings_regr_2$neighbourhood_cleansed == "BOVISA" |
listings_regr_2$neighbourhood_cleansed == "BOVISASCA"|
listings_regr_2$neighbourhood_cleansed == "BRUZZANO" |
listings_regr_2$neighbourhood_cleansed == "COMASINA" |
listings_regr_2$neighbourhood_cleansed == "DERGANO" |
listings_regr_2$neighbourhood_cleansed == "FARINI" |
listings_regr_2$neighbourhood_cleansed == "GHISOLFA" |
listings_regr_2$neighbourhood_cleansed == "GRECO" |
listings_regr_2$neighbourhood_cleansed == "LORETO" |
listings_regr_2$neighbourhood_cleansed == "MACIACHINI - MAGGIOLINA" |
listings_regr_2$neighbourhood_cleansed == "MAGGIORE - MUSOCCO" |
listings_regr_2$neighbourhood_cleansed == "NIGUARDA - CA' GRANDA" |
listings_regr_2$neighbourhood_cleansed == "PADOVA" |
listings_regr_2$neighbourhood_cleansed == "PAGANO" |
listings_regr_2$neighbourhood_cleansed == "PARCO LAMBRO - CIMIANO" |
listings_regr_2$neighbourhood_cleansed == "PARCO NORD" |
listings_regr_2$neighbourhood_cleansed == "QUARTO OGGIARO" |
listings_regr_2$neighbourhood_cleansed == "VIALE MONZA" |
listings_regr_2$neighbourhood_cleansed == "VILLAPIZZONE", "NORTH",
ifelse(
listings_regr_2$neighbourhood_cleansed == "BAGGIO" |
listings_regr_2$neighbourhood_cleansed == "PORTELLO" |
listings_regr_2$neighbourhood_cleansed == "QT 8" |
listings_regr_2$neighbourhood_cleansed == "QUARTO CAGNINO" |
listings_regr_2$neighbourhood_cleansed == "QUINTO ROMANO" |
listings_regr_2$neighbourhood_cleansed == "BANDE NERE" |
listings_regr_2$neighbourhood_cleansed == "FIGINO" |
listings_regr_2$neighbourhood_cleansed == "GIAMBELLINO" |
listings_regr_2$neighbourhood_cleansed == "LORENTEGGIO" |
listings_regr_2$neighbourhood_cleansed == "MUGGIANO" |
listings_regr_2$neighbourhood_cleansed == "PARCO BOSCO IN CITT\u0085" |
listings_regr_2$neighbourhood_cleansed == "S. CRISTOFORO" |
listings_regr_2$neighbourhood_cleansed == "S. SIRO" |
listings_regr_2$neighbourhood_cleansed == "SACCO" |
listings_regr_2$neighbourhood_cleansed == "SELINUNTE" |
listings_regr_2$neighbourhood_cleansed == "TRE TORRI" |
listings_regr_2$neighbourhood_cleansed == "TRENNO", "WEST",
ifelse(
listings_regr_2$neighbourhood_cleansed == "BARONA" |
listings_regr_2$neighbourhood_cleansed == "CANTALUPA" |
listings_regr_2$neighbourhood_cleansed == "CHIARAVALLE" |
listings_regr_2$neighbourhood_cleansed == "EX OM - MORIVIONE" |
listings_regr_2$neighbourhood_cleansed == "GRATOSOGLIO - TICINELLO" |
listings_regr_2$neighbourhood_cleansed == "LODI - CORVETTO" |
listings_regr_2$neighbourhood_cleansed == "PARCO AGRICOLO SUD" |
listings_regr_2$neighbourhood_cleansed == "PARCO DEI NAVIGLI" |
listings_regr_2$neighbourhood_cleansed == "PARCO DELLE ABBAZIE" |
listings_regr_2$neighbourhood_cleansed == "QUINTOSOLE" |
listings_regr_2$neighbourhood_cleansed == "RIPAMONTI" |
listings_regr_2$neighbourhood_cleansed == "RONCHETTO SUL NAVIGLIO" |
listings_regr_2$neighbourhood_cleansed == "SCALO ROMANA" |
listings_regr_2$neighbourhood_cleansed == "STADERA" |
listings_regr_2$neighbourhood_cleansed == "TIBALDI" |
listings_regr_2$neighbourhood_cleansed == "TRIULZO SUPERIORE" |
listings_regr_2$neighbourhood_cleansed == "VIGENTINA", "SOUTH",
ifelse(
listings_regr_2$neighbourhood_cleansed == "CITTA' STUDI" |
listings_regr_2$neighbourhood_cleansed == "CORSICA" |
listings_regr_2$neighbourhood_cleansed == "FORZE ARMATE" |
listings_regr_2$neighbourhood_cleansed == "GALLARATESE" |
listings_regr_2$neighbourhood_cleansed == "LAMBRATE" |
listings_regr_2$neighbourhood_cleansed == "MECENATE" |
listings_regr_2$neighbourhood_cleansed == "ORTOMERCATO" |
listings_regr_2$neighbourhood_cleansed == "PARCO FORLANINI - ORTICA" |
listings_regr_2$neighbourhood_cleansed == "PARCO MONLUE' - PONTE LAMBRO" |
listings_regr_2$neighbourhood_cleansed == "ROGOREDO" |
listings_regr_2$neighbourhood_cleansed == "UMBRIA - MOLISE", "EAST",
ifelse(
listings_regr_2$neighbourhood_cleansed == "BRERA" |
listings_regr_2$neighbourhood_cleansed == "BUENOS AIRES - VENEZIA" |
listings_regr_2$neighbourhood_cleansed == "CENTRALE" |
listings_regr_2$neighbourhood_cleansed == "DE ANGELI - MONTE ROSA" |
listings_regr_2$neighbourhood_cleansed == "DUOMO" |
listings_regr_2$neighbourhood_cleansed == "GARIBALDI REPUBBLICA" |
listings_regr_2$neighbourhood_cleansed == "GIARDINI PORTA VENEZIA" |
listings_regr_2$neighbourhood_cleansed == "GUASTALLA" |
listings_regr_2$neighbourhood_cleansed == "ISOLA" |
listings_regr_2$neighbourhood_cleansed == "MAGENTA - S. VITTORE" |
listings_regr_2$neighbourhood_cleansed == "NAVIGLI" |
listings_regr_2$neighbourhood_cleansed == "PARCO SEMPIONE" |
listings_regr_2$neighbourhood_cleansed == "PORTA ROMANA" |
listings_regr_2$neighbourhood_cleansed == "SARPI" |
listings_regr_2$neighbourhood_cleansed == "TICINESE" |
listings_regr_2$neighbourhood_cleansed == "TORTONA" |
listings_regr_2$neighbourhood_cleansed == "WASHINGTON" |
listings_regr_2$neighbourhood_cleansed == "XXII MARZO", "CENTRAL", "-")))))
skim(listings_regr_2$neighbourhood_simplified)| Name | listings_regr_2$neighbour… |
| Number of rows | 274 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 4 | 7 | 0 | 5 | 0 |
model5 <- lm(log_price_4_nights ~ number_of_reviews + room_type + neighbourhood_simplified, data = listings_regr_2)
summary(model5)
Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type +
neighbourhood_simplified, data = listings_regr_2)
Residuals:
Min 1Q Median 3Q Max
-1.20515 -0.45724 -0.04588 0.37618 2.11966
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.866157 0.053392 128.598 < 2e-16 ***
number_of_reviews -0.003668 0.001020 -3.596 0.000384 ***
room_typePrivate room -0.280723 0.087624 -3.204 0.001521 **
neighbourhood_simplifiedEAST -0.497822 0.132211 -3.765 0.000205 ***
neighbourhood_simplifiedNORTH -0.452489 0.094957 -4.765 3.10e-06 ***
neighbourhood_simplifiedSOUTH -0.585835 0.129768 -4.514 9.53e-06 ***
neighbourhood_simplifiedWEST 0.001378 0.140028 0.010 0.992155
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.599 on 267 degrees of freedom
Multiple R-squared: 0.2182, Adjusted R-squared: 0.2006
F-statistic: 12.42 on 6 and 267 DF, p-value: 2.445e-12
autoplot(model5)vif(model5) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.008025 1 1.004004
room_type 1.026607 1 1.013216
neighbourhood_simplified 1.028335 4 1.003499
avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?Based on model6 below, coefficient “availability_30” is significant in explaining variability of prices, whereas “reviews_per_month” isn’t. For this reason further model6.1 and model 6.2 are computed, and as model 6.2 has a higher adjusted r² with all regressors significant, we think it outperforms model6.1.
model6 <- lm(log_price_4_nights ~ number_of_reviews + room_type + neighbourhood_simplified + availability_30 + reviews_per_month, data = listings_regr_2)
summary(model6)
Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type +
neighbourhood_simplified + availability_30 + reviews_per_month,
data = listings_regr_2)
Residuals:
Min 1Q Median 3Q Max
-0.95971 -0.33851 -0.06812 0.26488 1.71802
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6285326 0.0648658 102.188 < 2e-16 ***
number_of_reviews -0.0001663 0.0016725 -0.099 0.920881
room_typePrivate room -0.2794309 0.0905326 -3.087 0.002344 **
neighbourhood_simplifiedEAST -0.4050995 0.1218563 -3.324 0.001073 **
neighbourhood_simplifiedNORTH -0.4511803 0.0994412 -4.537 1.04e-05 ***
neighbourhood_simplifiedSOUTH -0.4671256 0.1241691 -3.762 0.000227 ***
neighbourhood_simplifiedWEST -0.3447407 0.1778660 -1.938 0.054154 .
availability_30 0.0192026 0.0034246 5.607 7.58e-08 ***
reviews_per_month -0.1198490 0.0964339 -1.243 0.215546
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5067 on 181 degrees of freedom
(因为不存在,84个观察量被删除了)
Multiple R-squared: 0.3429, Adjusted R-squared: 0.3138
F-statistic: 11.8 on 8 and 181 DF, p-value: 1.784e-13
autoplot(model6)vif(model6) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 3.514313 1 1.874650
room_type 1.061876 1 1.030473
neighbourhood_simplified 1.117805 4 1.014018
availability_30 1.032053 1 1.015900
reviews_per_month 3.650668 1 1.910672
# In model 6, after adding review_per_month, the number_of_reviews become not significant, indicating that there is a slight collinearity between the two variables. Hence, we need to compare the two variables.
model6.1 <- lm(log_price_4_nights ~ room_type + neighbourhood_simplified + availability_30 + reviews_per_month, data = listings_regr_2)
summary(model6.1)
Call:
lm(formula = log_price_4_nights ~ room_type + neighbourhood_simplified +
availability_30 + reviews_per_month, data = listings_regr_2)
Residuals:
Min 1Q Median 3Q Max
-0.9593 -0.3345 -0.0696 0.2640 1.7188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.628919 0.064573 102.658 < 2e-16 ***
room_typePrivate room -0.280332 0.089832 -3.121 0.002099 **
neighbourhood_simplifiedEAST -0.404990 0.121519 -3.333 0.001042 **
neighbourhood_simplifiedNORTH -0.450660 0.099033 -4.551 9.75e-06 ***
neighbourhood_simplifiedSOUTH -0.465345 0.122537 -3.798 0.000199 ***
neighbourhood_simplifiedWEST -0.346988 0.175945 -1.972 0.050109 .
availability_30 0.019210 0.003415 5.626 6.87e-08 ***
reviews_per_month -0.127902 0.052235 -2.449 0.015290 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5054 on 182 degrees of freedom
(因为不存在,84个观察量被删除了)
Multiple R-squared: 0.3428, Adjusted R-squared: 0.3175
F-statistic: 13.56 on 7 and 182 DF, p-value: 4.638e-14
autoplot(model6.1)vif(model6.1) GVIF Df GVIF^(1/(2*Df))
room_type 1.051233 1 1.025296
neighbourhood_simplified 1.070531 4 1.008556
availability_30 1.031628 1 1.015691
reviews_per_month 1.076982 1 1.037778
model6.2 <- lm(log_price_4_nights ~ number_of_reviews + room_type + neighbourhood_simplified + availability_30, data = listings_regr_2) # Model 6.2 has a higher adjusted r², therefore we keep this one
summary(model6.2)
Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type +
neighbourhood_simplified + availability_30, data = listings_regr_2)
Residuals:
Min 1Q Median 3Q Max
-1.03538 -0.35730 -0.07754 0.29912 2.40632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6738487 0.0549696 121.410 < 2e-16 ***
number_of_reviews -0.0025690 0.0009399 -2.733 0.006695 **
room_typePrivate room -0.2878995 0.0797799 -3.609 0.000368 ***
neighbourhood_simplifiedEAST -0.4410529 0.1206043 -3.657 0.000308 ***
neighbourhood_simplifiedNORTH -0.5011355 0.0866931 -5.781 2.08e-08 ***
neighbourhood_simplifiedSOUTH -0.5632983 0.1181807 -4.766 3.09e-06 ***
neighbourhood_simplifiedWEST -0.0929711 0.1281032 -0.726 0.468629
availability_30 0.0208242 0.0027794 7.492 1.00e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5453 on 266 degrees of freedom
Multiple R-squared: 0.3544, Adjusted R-squared: 0.3375
F-statistic: 20.86 on 7 and 266 DF, p-value: < 2.2e-16
autoplot(model6.2)vif(model6.2) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.033172 1 1.016451
room_type 1.026755 1 1.013289
neighbourhood_simplified 1.051702 4 1.006321
availability_30 1.048092 1 1.023764
As you keep building your models, it makes sense to:
autoplot(model_x).Done for each model
All models have low VIF
huxtable (https://mfa2022.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.huxreg(model1, model2, model3, model4, model5, model6.2)| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 6.457 *** | 6.697 *** | 6.609 *** | 6.660 *** | 6.866 *** | 6.674 *** |
| (0.245) | (0.047) | (0.279) | (0.051) | (0.053) | (0.055) | |
| prop_type_simplifiedEntire loft | -0.052 | |||||
| (0.230) | ||||||
| prop_type_simplifiedEntire rental unit | -0.015 | |||||
| (0.144) | ||||||
| prop_type_simplifiedOther | -0.074 | |||||
| (0.203) | ||||||
| prop_type_simplifiedPrivate room in rental unit | -0.323 | |||||
| (0.167) | ||||||
| number_of_reviews | -0.003 ** | -0.004 *** | -0.004 ** | -0.004 *** | -0.004 *** | -0.003 ** |
| (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
| review_scores_rating | 0.027 | |||||
| (0.043) | ||||||
| room_typePrivate room | -0.341 *** | -0.414 *** | -0.364 *** | -0.281 ** | -0.288 *** | |
| (0.093) | (0.100) | (0.093) | (0.088) | (0.080) | ||
| bathrooms_clean | 0.049 | |||||
| (0.179) | ||||||
| bedrooms | 0.013 | |||||
| (0.204) | ||||||
| beds | 0.063 | |||||
| (0.091) | ||||||
| host_is_superhostTRUE | -0.077 | |||||
| (0.149) | ||||||
| instant_bookableTRUE | 0.174 | |||||
| (0.092) | ||||||
| neighbourhood_simplifiedEAST | -0.498 *** | -0.441 *** | ||||
| (0.132) | (0.121) | |||||
| neighbourhood_simplifiedNORTH | -0.452 *** | -0.501 *** | ||||
| (0.095) | (0.087) | |||||
| neighbourhood_simplifiedSOUTH | -0.586 *** | -0.563 *** | ||||
| (0.130) | (0.118) | |||||
| neighbourhood_simplifiedWEST | 0.001 | -0.093 | ||||
| (0.140) | (0.128) | |||||
| availability_30 | 0.021 *** | |||||
| (0.003) | ||||||
| N | 190 | 274 | 238 | 274 | 274 | 274 |
| R2 | 0.072 | 0.088 | 0.107 | 0.100 | 0.218 | 0.354 |
| logLik | -168.613 | -265.892 | -230.403 | -264.073 | -244.811 | -218.579 |
| AIC | 353.226 | 539.783 | 476.805 | 538.146 | 505.621 | 455.158 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||
price_4_nights.The best model according to the above huxtable is model6.2, which is the model with the highest R-squared (0.354) and regressors are not colinear with significant power to explain the Y, i.e. the model that best fits our observations.
After conducting antilog and point prediction, we draw a line plot to compare the prediction and real total costs. As considering our selection criteria, there are only 9 properties left, so we set the x-axis as the index/id of the property (0-8). We can see that the fitted line flctuates around the real line, which is within the range of 95% confidenc einterval.
# We filter the original dataset to get our potential choices
listings_regr_3 <- listings_regr_2 %>%
filter(room_type == "Private room" & number_of_reviews >= 10 & review_scores_rating >= 4.5)
# Then we create data frame containing variables the regression model needs
options <- listings_regr_3[c('number_of_reviews', 'room_type', 'neighbourhood_simplified', 'availability_30')]
# As stated before, we would choose model 6.2 as our best model to predict here
log_predictions <- predict(model6.2, options, interval = "confidence", level = 0.95)
# Antilog to get the total costs
predictions <- exp(log_predictions)
predictions fit lwr upr
1 578.3963 490.8187 681.6006
2 976.8259 741.8942 1286.1520
3 280.0911 213.6477 367.1980
4 323.4941 209.8639 498.6492
5 415.3426 346.1236 498.4043
6 548.0194 463.4663 647.9979
7 267.9445 189.4118 379.0378
8 342.4208 266.4219 440.0990
9 447.4150 357.8257 559.4349
# Plot the point predictions with confidence intervals and real values
ggplot(predictions, aes(y = fit, x = c(0,1,2,3,4,5,6,7,8))) +
geom_ribbon(aes(ymin = fit - lwr,
ymax = fit + upr,
xmin = 0,
xmax = 8),
fill = "lightblue") +
scale_x_continuous(expand = c(0, 0), limits = c(0, 8), breaks = c(1,2,3,4,5,6,7,8)) +
geom_line(data = listings_regr_3,
mapping = aes(y = price * 2 * 4, x = c(0,1,2,3,4,5,6,7,8),
color = 'black')) +
geom_line(data = predictions,
mapping = aes(y = fit, x = c(0,1,2,3,4,5,6,7,8),
color = 'red')) +
scale_colour_manual(labels = c('real', 'predict'), values = c('red', 'black')) +
labs(title = 'Predict and Real Total Costs for 2 Person 4 Night',
x = 'Index',
y = 'Total Cost') +
theme_bw() +
theme(legend.title=element_blank()) +
NULLprice_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.Remember to follow R Markdown etiquette rules and style; don’t have the Rmd output extraneous messages or warnings, include summary tables in nice tables (use kableExtra), and remove any placeholder texts from past Rmd templates; in other words, (i.e. I don’t want to see stuff I wrote in your final report.)
Your work will be assessed on a rubric which you can find here